Regularized Discriminant Analysis Incorporating Prior Knowledge on Gene Functional Groups
نویسندگان
چکیده
In the last decade, the renaissance of interest in discriminant analysis has been primarily motivated by possible applications to tumor classification using highdimensional microarray-based data. In this thesis, we do three things: 1. First, we introduce a new regularizing covariance estimation procedure we refer to as SHIP: SHrinking and Incorporating Prior knowledge. The resulting covariance estimator is based on the shrinkage estimator by Ledoit and Wolf [31, 33, 32], but additionally incorporates prior knowledge on gene functional groups extracted from the database KEGG. In order to integrate this knowledge into the shrinkage estimator, we develop multiple options. Instead of using a standard cross-validation procedure for determining the optimal shrinkage intensity, we determine it analytically as introduced by Ledoit and Wolf. 2. Second, we propose a variant of regularized linear discriminant analysis. This method generalizes the idea of the shrinkage estimator from above into the linear discriminant analysis (LDA). 3. Third, we apply our method to public gene expression data sets and examine the classification performance in both the binary and the c-nary case, where c > 2. We choose the diagonal linear discriminant analysis and the nearest shrunken centroids method [15] as competitors. It is shown that the rlda.TG one of our variants of LDA ‘via the SHIP’ performs well in all classification problems and even outperforms, albeit marginally, the competitors in some situations. Unexpectedly, we find that another variant of LDA which is based on the shrinkage estimator by Ledoit and Wolf and which does not incorporate any biological knowledge is as competitive as the rlda.TG.
منابع مشابه
SHrinkage Covariance Estimation Incorporating Prior Biological Knowledge with Applications to High-Dimensional Data
In “-omic data” analysis, information on the structure of covariates are broadly available either from public databases describing gene regulation processes and functional groups such as the Kyoto encyclopedia of genes and genomes (KEGG), or from statistical analyses – for example in form of partial correlation estimators. The analysis of transcriptomic data might benefit from the incorporation...
متن کاملIncorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data
MOTIVATION Discriminant analysis for high-dimensional and low-sample-sized data has become a hot research topic in bioinformatics, mainly motivated by its importance and challenge in applications to tumor classifications for high-dimensional microarray data. Two of the popular methods are the nearest shrunken centroids, also called predictive analysis of microarray (PAM), and shrunken centroids...
متن کاملOver-optimism in bioinformatics: an illustration
MOTIVATION In statistical bioinformatics research, different optimization mechanisms potentially lead to 'over-optimism' in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS We present an empirical study on over-optimism using high-dimensional classification as example. Specifically, we consider a 'p...
متن کاملBayesian Quadratic Discriminant Analysis
Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bre...
متن کاملLung cancer gene expression database analysis incorporating prior knowledge with support vector machine-based classification method
BACKGROUND A reliable and precise classification is essential for successful diagnosis and treatment of cancer. Gene expression microarrays have provided the high-throughput platform to discover genomic biomarkers for cancer diagnosis and prognosis. Rational use of the available bioinformation can not only effectively remove or suppress noise in gene chips, but also avoid one-sided results of s...
متن کامل